Boston data set about housing values in suburbs of Boston has 14 variables and 506 observations, of which 12 are continuous numerical variables and 2 discrete integers. It includes several variables of possible factors that might influence on housing values which are listed below.
-crim: per capita crime rate by town.
-zn: proportion of residential land zoned for lots over 25,000 sq.ft.
-indus: proportion of non-retail business acres per town.
-chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
-nox: nitrogen oxides concentration (parts per 10 million).
-rm: average number of rooms per dwelling.
-age: proportion of owner-occupied units built prior to 1940.
-dis: weighted mean of distances to five Boston employment centres.
-rad: index of accessibility to radial highways.
-tax: full-value property-tax rate per $10,000.
-ptratio: pupil-teacher ratio by town.
-black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
-lstat: lower status of the population (percent).
-medv: median value of owner-occupied homes in $1000s.
# load the data from MASS package
data("Boston")
## check how the data look like (structure and dimension etc.)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
To explore variables and their relationship further, I created plots about their distributions and correlations.
# Plots
p <- ggpairs(Boston, mapping = aes(alpha = 0.5), lower = list(combo = wrap("facethist", bins = 20)))
p1 <- p + ggtitle("Housing values in suburbs of Boston")
p1
Figure 1. Boston data variable
# correlation plot
cor_matrix<-cor(Boston) %>% round(2)
corrplot(cor_matrix, method="circle", type="upper", cl.pos="b", tl.pos="d",tl.cex=0.6)
Figure 2. Correlations
From distribution plots in Figure 1, it can be seen that almost all variables do not follow normal distribution. Only one variable, average numbers of rooms per dvelling (rm), seems to follow normal distribution. Other variables are either right skewed (crim, zn, chas, nox,dis, lstat and medv), left skewed (age, ptratio, and black) or clearly bimodal (indus, rad and tax). Also there seems to strong correlation between many of the variables. Correlations are plotted also in the Figure 2, where red indicates negative correlation and blue positive correlation and the size and color density how strong the correlation is. Darker and bigger indicates stronger correlation than small and light. For example, between lower status of the population (lstat) and median value of owner-occupied homes in $1000s (medv), has strong negative correlation. Whereas, between index of accessibility to radial highways (rad) and full-value property-tax rate per $10000 (tax) there is strong positive correlation. Whether these observations are statistically significant, can be confirmed by statistical tests.
## summary of the data
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
Skewnes of the variables can also been observed in the summary statistics (above). In crim, interval between min 0.00632 and max 88,97 is wide, but median is only 0.256. Results are difficult to interpret because of the scale used, so the data should be standardized to make it easier (below). Now the results are comparable with each other and easier to interpret.
# standardize data and print out the summary
boston_scaled <- as.data.frame(scale(Boston))
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
#names(boston_scaled)
Linear discrimant analysis (LDA) is a classification method that can be used to find those variables that best separate classes, to class prediction of new data and for dimensional reduction. Target variable can be either binary or multiclass variable. LDA assumes that variables follow normal distribution and each class share same covariance matrix. To perform with the target variable crime, data is divided into train and test.
#create a categorical variable
scaled_crim <- boston_scaled$crim
# summary of the scaled_crim
summary(scaled_crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419400 -0.410600 -0.390300 0.000000 0.007389 9.924000
# create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
# create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
# Divide dataset to test and train sets
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# train set
train <- boston_scaled[ind,]
# and test set
test <- boston_scaled[-ind,]
# linear discriminant analysis on the train set
lda.fit <- lda(crime ~., data = train)
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2599010 0.2574257 0.2450495 0.2376238
##
## Group means:
## zn indus chas nox rm
## low 0.9409768 -0.9015249 -0.084848104 -0.8736702 0.4247167
## med_low -0.1135082 -0.2576328 -0.007331936 -0.5449535 -0.1088236
## med_high -0.3729012 0.1737975 0.165126514 0.3953306 0.1167668
## high -0.4872402 1.0172418 -0.108283225 1.0659835 -0.4271663
## age dis rad tax ptratio
## low -0.8550002 0.8598096 -0.6996764 -0.7314427 -0.44708494
## med_low -0.3206475 0.2754719 -0.5478832 -0.4510342 -0.05940589
## med_high 0.3819466 -0.3929399 -0.4227186 -0.3212120 -0.32052420
## high 0.8183463 -0.8653103 1.6368728 1.5131579 0.77931510
## black lstat medv
## low 0.38593802 -0.75884629 0.516783100
## med_low 0.31131803 -0.11247059 0.009815111
## med_high 0.07455151 0.01044852 0.211476508
## high -0.90674677 0.93780690 -0.709518409
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.10421979 0.6963638900 -0.87623981
## indus 0.03977506 -0.2537820201 0.29710233
## chas -0.04075613 -0.0047725687 0.07528764
## nox 0.44539291 -0.8724281106 -1.41063077
## rm 0.04033224 -0.0648843955 -0.10293019
## age 0.22330544 -0.1891520043 -0.18449982
## dis -0.06621316 -0.2482051644 -0.03634122
## rad 3.41785854 0.9510175825 -0.21895872
## tax 0.00065747 0.0008216975 0.68946457
## ptratio 0.13003811 -0.0045906986 -0.25369643
## black -0.11288449 0.0338764238 0.15261219
## lstat 0.23258159 -0.2542925956 0.42029270
## medv 0.12818572 -0.5198363264 -0.16823018
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9552 0.0334 0.0113
# Plot
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col=classes, pch=classes, main="LDA")
lda.arrows(lda.fit, myscale = 2)
## Warning in arrows(x0 = 0, y0 = 0, x1 = myscale * heads[, choices[1]], y1 =
## myscale * : zero-length arrow is of indeterminate angle and so skipped
Figure 3. LDA
Figure 3 represents the LDA results, in which different colors represent different classes and the arrows show the impact of each predictor variables. Accessibility to radial highways seems to be strongly associated with high crime rates.
Classes can be predicted from the LDA model by using test data and this infromation can be used to test the performance of the model. Correct values are crosstabulated against predicted values.
# save test crime
correct_classes <- test$crime
## test set for prediction
test <- dplyr::select(test, -crime)
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# and cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 15 6 1 0
## med_low 4 16 2 0
## med_high 1 11 13 2
## high 0 0 0 31
Model performs well with high class, all predictions are correct. Prediction of low is correct 16 times but predicted 2 times to be medium low. Medium low is predicted to be correct 15 times and wrong 14 times, so model performs weakly with then medium low values. Medium high is predicted to be correct 20 times and incorrect 4 times. Model thus performs well with all other except medium low.
K-means is a method for clustering which uses the similarity of the objects to grouping or clustering. Number of clusters that best fit for the data can be determined for example by looking at how the total of within cluster sum of squares (WCSS) behaves when the number of cluster changes. Optimal number can be seen from figure 4 from the point where WCSS drops radically. This happens when there are 2 clusters.
data("Boston")
#scale
boston_scaled <- as.data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4620 4.8240 4.9110 6.1860 14.4000
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results to find optimal number of clusters
plot(1:k_max, twcss, type='b')
Figure 4. K-means clustering
# Plot
# k-means clustering
km <-kmeans(dist_eu, centers = 2)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
Figure 5. Variables by two cluster
In Figure 5, scaled variables are presented by the cluster. The difference between the clusters can be seen in almost all pictures. There is clear difference for example crime and accessibility to radial highways which was seen also in the LDA plot.